{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "intro-header-v5",
   "metadata": {},
   "source": [
    "# Learning RAG: Testing Configurations Step-by-Step\n",
    "## An Educational End-to-End Pipeline with Enhanced Evaluation\n",
    "\n",
    "This notebook is designed as a learning project to understand how different settings impact Retrieval-Augmented Generation (RAG) systems. We'll build and test a pipeline step-by-step using the **Nebius AI API**.\n",
    "\n",
    "**What we'll learn:**\n",
    "*   How text chunking (`chunk_size`, `chunk_overlap`) affects what the RAG system retrieves.\n",
    "*   How the number of retrieved documents (`top_k`) influences the context provided to the LLM.\n",
    "*   The difference between three common RAG strategies (Simple, Query Rewrite, Rerank).\n",
    "*   How to use an LLM (like Nebius AI) to automatically evaluate the quality of generated answers using multiple metrics: **Faithfulness**, **Relevancy**, and **Semantic Similarity** to a ground truth answer.\n",
    "*   How to combine these metrics into an average score for easier comparison.\n",
    "\n",
    "We'll focus on understanding *why* we perform each step and observing the outcomes clearly, with detailed explanations and commented code."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "toc-v5",
   "metadata": {},
   "source": [
    "### Table of Contents\n",
    "1.  **Setup: Installing Libraries**: Get the necessary tools.\n",
    "2.  **Setup: Importing Libraries**: Bring the tools into our workspace.\n",
    "3.  **Configuration: Setting Up Our Experiment**: Define API details, models, evaluation prompts, and parameters to test.\n",
    "4.  **Input Data: The Knowledge Source & Our Question**: Define the documents the RAG system will learn from and the question we'll ask.\n",
    "5.  **Core Component: Text Chunking Function**: Create a function to break documents into smaller pieces.\n",
    "6.  **Core Component: Connecting to Nebius AI**: Establish the connection to use Nebius models.\n",
    "7.  **Core Component: Cosine Similarity Function**: Create a function to measure semantic similarity between texts.\n",
    "8.  **The Experiment: Iterating Through Configurations**: The main loop where we test different settings.\n",
    "    *   8.1 Processing a Chunking Configuration (Chunk, Embed, Index)\n",
    "    *   8.2 Testing RAG Strategies for a `top_k` Value\n",
    "    *   8.3 Running & Evaluating a Single RAG Strategy (including Similarity)\n",
    "9.  **Analysis: Reviewing the Results**: Use Pandas to organize and display the results.\n",
    "10. **Conclusion: What Did We Learn?**: Reflect on the findings and potential next steps."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "setup-install-v5",
   "metadata": {},
   "source": [
    "### 1. Setup: Installing Libraries\n",
    "\n",
    "First, we need to install the Python packages required for this notebook. \n",
    "- `openai`: Interacts with the Nebius API (which uses an OpenAI-compatible interface).\n",
    "- `pandas`: For creating and managing data tables (DataFrames).\n",
    "- `numpy`: For numerical operations, especially with vectors (embeddings).\n",
    "- `faiss-cpu`: For efficient similarity search on vectors (the retrieval part).\n",
    "- `ipywidgets`, `tqdm`: For displaying progress bars in Jupyter.\n",
    "- `scikit-learn`: For calculating cosine similarity."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "install-libs-v5",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Install libraries (run this cell only once if needed)\n",
    "# !pip install openai pandas numpy faiss-cpu ipywidgets tqdm scikit-learn"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "install-note-v5",
   "metadata": {},
   "source": [
    "**Remember!** After the installation finishes, you might need to **Restart the Kernel** (or Runtime) for Jupyter/Colab to recognize the newly installed packages. Look for this option in the menu (e.g., 'Kernel' -> 'Restart Kernel...' or 'Runtime' -> 'Restart Runtime')."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "setup-import-v5",
   "metadata": {},
   "source": [
    "### 2. Setup: Importing Libraries\n",
    "\n",
    "With the libraries installed, we import them into our Python environment to make their functions available."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "import-code-v5",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Libraries imported successfully!\n"
     ]
    }
   ],
   "source": [
    "import os                     # For accessing environment variables (like API keys)\n",
    "import time                   # For timing operations\n",
    "import re                     # For regular expressions (text cleaning)\n",
    "import warnings               # For controlling warning messages\n",
    "import itertools              # For creating parameter combinations easily\n",
    "import getpass                # For securely prompting for API keys if not set\n",
    "\n",
    "import numpy as np            # Numerical library for vector operations\n",
    "import pandas as pd           # Data manipulation library for tables (DataFrames)\n",
    "import faiss                  # Library for fast vector similarity search\n",
    "from openai import OpenAI     # Client library for Nebius API interaction\n",
    "from tqdm.notebook import tqdm # Library for displaying progress bars\n",
    "from sklearn.metrics.pairwise import cosine_similarity # For calculating similarity score\n",
    "\n",
    "# Configure display options for Pandas DataFrames for better readability\n",
    "pd.set_option('display.max_colwidth', 150) # Show more text content in table cells\n",
    "pd.set_option('display.max_rows', 100)     # Display more rows in tables\n",
    "warnings.filterwarnings('ignore', category=FutureWarning) # Suppress specific non-critical warnings\n",
    "\n",
    "print(\"Libraries imported successfully!\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "config-params-v5",
   "metadata": {},
   "source": [
    "### 3. Configuration: Setting Up Our Experiment\n",
    "\n",
    "Here, we define all the settings and parameters for our experiment directly as Python variables. This makes it easy to see and modify the configuration in one place.\n",
    "\n",
    "**Key Configuration Areas:**\n",
    "*   **Nebius API Details:** Credentials and model identifiers for connecting to Nebius AI.\n",
    "*   **LLM Settings:** Parameters controlling the behavior of the language model during answer generation (e.g., `temperature` for creativity).\n",
    "*   **Evaluation Prompts:** The specific instructions (prompts) given to the LLM when it acts as an evaluator for Faithfulness and Relevancy.\n",
    "*   **Tuning Parameters:** The different values for chunk size, overlap, and retrieval `top_k` that we want to systematically test.\n",
    "*   **Reranking Setting:** Configuration for the simulated reranking strategy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "config-setup-v5",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--- Configuration Check --- \n",
      "Attempting to load Nebius API Key from environment variable 'NEBIUS_API_KEY'...\n",
      "Nebius API Key loaded successfully from environment variable.\n",
      "Models: Embed='BAAI/bge-multilingual-gemma2', Gen='deepseek-ai/DeepSeek-V3', Eval='deepseek-ai/DeepSeek-V3'\n",
      "Chunk Sizes to Test: [150, 250]\n",
      "Overlaps to Test: [30, 50]\n",
      "Top-K Values to Test: [3, 5]\n",
      "Generation Temp: 0.1, Max Tokens: 400\n",
      "Configuration ready.\n",
      "-------------------------\n"
     ]
    }
   ],
   "source": [
    "# --- NebiusAI API Configuration ---\n",
    "# It's best practice to store API keys as environment variables rather than hardcoding them.\n",
    "# Provide your actual key here or set it as an environment variable\n",
    "NEBIUS_API_KEY = os.getenv('NEBIUS_API_KEY', None)  # Load API key from environment variable\n",
    "if NEBIUS_API_KEY is None:\n",
    "    print(\"Warning: NEBIUS_API_KEY not set. Please set it in your environment variables or provide it directly in the code.\") \n",
    "NEBIUS_BASE_URL = \"https://api.studio.nebius.com/v1/\" \n",
    "NEBIUS_EMBEDDING_MODEL = \"BAAI/bge-multilingual-gemma2\"  # Model for converting text to vector embeddings\n",
    "NEBIUS_GENERATION_MODEL = \"deepseek-ai/DeepSeek-V3\"    # LLM for generating the final answers\n",
    "NEBIUS_EVALUATION_MODEL = \"deepseek-ai/DeepSeek-V3\"    # LLM used for evaluating the generated answers\n",
    "\n",
    "# --- Text Generation Parameters (for RAG answer generation) ---\n",
    "GENERATION_TEMPERATURE = 0.1  # Lower values (e.g., 0.1-0.3) make output more focused and deterministic, good for fact-based answers.\n",
    "GENERATION_MAX_TOKENS = 400   # Maximum number of tokens (roughly words/sub-words) in the generated answer.\n",
    "GENERATION_TOP_P = 0.9        # Nucleus sampling parameter (alternative to temperature, usually fine at default).\n",
    "\n",
    "# --- Evaluation Prompts (Instructions for the Evaluator LLM) ---\n",
    "# Faithfulness: Does the answer stay true to the provided context?\n",
    "FAITHFULNESS_PROMPT = \"\"\"\n",
    "System: You are an objective evaluator. Evaluate the faithfulness of the AI Response compared to the True Answer, considering only the information present in the True Answer as the ground truth.\n",
    "Faithfulness measures how accurately the AI response reflects the information in the True Answer, without adding unsupported facts or contradicting it.\n",
    "Score STRICTLY using a float between 0.0 and 1.0, based on this scale:\n",
    "- 0.0: Completely unfaithful, contradicts or fabricates information.\n",
    "- 0.1-0.4: Low faithfulness with significant inaccuracies or unsupported claims.\n",
    "- 0.5-0.6: Partially faithful but with noticeable inaccuracies or omissions.\n",
    "- 0.7-0.8: Mostly faithful with only minor inaccuracies or phrasing differences.\n",
    "- 0.9: Very faithful, slight wording differences but semantically aligned.\n",
    "- 1.0: Completely faithful, accurately reflects the True Answer.\n",
    "Respond ONLY with the numerical score.\n",
    "\n",
    "User:\n",
    "Query: {question}\n",
    "AI Response: {response}\n",
    "True Answer: {true_answer}\n",
    "Score:\"\"\"\n",
    "\n",
    "# Relevancy: Does the answer directly address the user's query?\n",
    "RELEVANCY_PROMPT = \"\"\"\n",
    "System: You are an objective evaluator. Evaluate the relevance of the AI Response to the specific User Query.\n",
    "Relevancy measures how well the response directly answers the user's question, avoiding unnecessary or off-topic information.\n",
    "Score STRICTLY using a float between 0.0 and 1.0, based on this scale:\n",
    "- 0.0: Not relevant at all.\n",
    "- 0.1-0.4: Low relevance, addresses a different topic or misses the core question.\n",
    "- 0.5-0.6: Partially relevant, answers only a part of the query or is tangentially related.\n",
    "- 0.7-0.8: Mostly relevant, addresses the main aspects of the query but might include minor irrelevant details.\n",
    "- 0.9: Highly relevant, directly answers the query with minimal extra information.\n",
    "- 1.0: Completely relevant, directly and fully answers the exact question asked.\n",
    "Respond ONLY with the numerical score.\n",
    "\n",
    "User:\n",
    "Query: {question}\n",
    "AI Response: {response}\n",
    "Score:\"\"\"\n",
    "\n",
    "# --- Parameters to Tune (The experimental variables) ---\n",
    "CHUNK_SIZES_TO_TEST = [150, 250]    # List of chunk sizes (in words) to experiment with.\n",
    "CHUNK_OVERLAPS_TO_TEST = [30, 50]   # List of chunk overlaps (in words) to experiment with.\n",
    "RETRIEVAL_TOP_K_TO_TEST = [3, 5]   # List of 'k' values (number of chunks to retrieve) to test.\n",
    "\n",
    "# --- Reranking Configuration (Only used for the Rerank strategy) ---\n",
    "RERANK_RETRIEVAL_MULTIPLIER = 3 # For simulated reranking: retrieve K * multiplier chunks initially.\n",
    "\n",
    "# --- Validate API Key --- \n",
    "print(\"--- Configuration Check --- \")\n",
    "print(f\"Attempting to load Nebius API Key from environment variable 'NEBIUS_API_KEY'...\")\n",
    "if not NEBIUS_API_KEY:\n",
    "    print(\"Nebius API Key not found in environment variables.\")\n",
    "    # Prompt the user securely if the key is not found.\n",
    "    NEBIUS_API_KEY = getpass.getpass(\"Please enter your Nebius API Key: \")\n",
    "else:\n",
    "    print(\"Nebius API Key loaded successfully from environment variable.\")\n",
    "\n",
    "# Print a summary of key settings for verification\n",
    "print(f\"Models: Embed='{NEBIUS_EMBEDDING_MODEL}', Gen='{NEBIUS_GENERATION_MODEL}', Eval='{NEBIUS_EVALUATION_MODEL}'\")\n",
    "print(f\"Chunk Sizes to Test: {CHUNK_SIZES_TO_TEST}\")\n",
    "print(f\"Overlaps to Test: {CHUNK_OVERLAPS_TO_TEST}\")\n",
    "print(f\"Top-K Values to Test: {RETRIEVAL_TOP_K_TO_TEST}\")\n",
    "print(f\"Generation Temp: {GENERATION_TEMPERATURE}, Max Tokens: {GENERATION_MAX_TOKENS}\")\n",
    "print(\"Configuration ready.\")\n",
    "print(\"-\" * 25)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "input-data-v5",
   "metadata": {},
   "source": [
    "### 4. Input Data: The Knowledge Source & Our Question\n",
    "\n",
    "Every RAG system needs a knowledge base to draw information from. Here, we define:\n",
    "*   `corpus_texts`: A list of strings, where each string is a document containing information (in this case, about renewable energy sources).\n",
    "*   `test_query`: The specific question we want the RAG system to answer using the `corpus_texts`.\n",
    "*   `true_answer_for_query`: A carefully crafted 'ground truth' answer based *only* on the information available in `corpus_texts`. This is essential for evaluating Faithfulness and Semantic Similarity accurately."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "corpus-def-v5",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loaded 5 documents into our corpus.\n",
      "Test Query: 'Compare the consistency and environmental impact of solar power versus hydropower.'\n",
      "Reference (True) Answer for evaluation: 'Solar power's consistency varies with weather and time of day, requiring storage like batteries. Hydropower is generally reliable, but large dams have significant environmental impacts on ecosystems and communities, unlike solar power's primary impact being land use for panels.'\n",
      "Input data is ready.\n",
      "-------------------------\n"
     ]
    }
   ],
   "source": [
    "# Our knowledge base: A list of text documents about renewable energy\n",
    "corpus_texts = [\n",
    "    \"Solar power uses PV panels or CSP systems. PV converts sunlight directly to electricity. CSP uses mirrors to heat fluid driving a turbine. It's clean but varies with weather/time. Storage (batteries) is key for consistency.\", # Doc 0\n",
    "    \"Wind energy uses turbines in wind farms. It's sustainable with low operating costs. Wind speed varies, siting can be challenging (visual/noise). Offshore wind is stronger and more consistent.\", # Doc 1\n",
    "    \"Hydropower uses moving water, often via dams spinning turbines. Reliable, large-scale power with flood control/water storage benefits. Big dams harm ecosystems and displace communities. Run-of-river is smaller, less disruptive.\", # Doc 2\n",
    "    \"Geothermal energy uses Earth's heat via steam/hot water for turbines. Consistent 24/7 power, small footprint. High initial drilling costs, sites are geographically limited.\", # Doc 3\n",
    "    \"Biomass energy from organic matter (wood, crops, waste). Burned directly or converted to biofuels. Uses waste, provides dispatchable power. Requires sustainable sourcing. Combustion releases emissions (carbon-neutral if balanced by regrowth).\" # Doc 4\n",
    "]\n",
    "\n",
    "# The question we will ask the RAG system\n",
    "test_query = \"Compare the consistency and environmental impact of solar power versus hydropower.\"\n",
    "\n",
    "# !!! CRITICAL: The 'True Answer' MUST be derivable ONLY from the corpus_texts above !!!\n",
    "# This is our ground truth for evaluation.\n",
    "true_answer_for_query = \"Solar power's consistency varies with weather and time of day, requiring storage like batteries. Hydropower is generally reliable, but large dams have significant environmental impacts on ecosystems and communities, unlike solar power's primary impact being land use for panels.\"\n",
    "\n",
    "print(f\"Loaded {len(corpus_texts)} documents into our corpus.\")\n",
    "print(f\"Test Query: '{test_query}'\")\n",
    "print(f\"Reference (True) Answer for evaluation: '{true_answer_for_query}'\")\n",
    "print(\"Input data is ready.\")\n",
    "print(\"-\" * 25)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "chunking-func-md-v5",
   "metadata": {},
   "source": [
    "### 5. Core Component: Text Chunking Function\n",
    "\n",
    "LLMs and embedding models have limits on the amount of text they can process at once. Furthermore, retrieval works best when searching over smaller, focused pieces of text rather than entire large documents. \n",
    "\n",
    "**Chunking** is the process of splitting large documents into smaller, potentially overlapping, segments.\n",
    "\n",
    "- **`chunk_size`**: Determines the approximate size (here, in words) of each chunk.\n",
    "- **`chunk_overlap`**: Specifies how many words from the end of one chunk should also be included at the beginning of the next chunk. This helps prevent relevant information from being lost if it spans across the boundary between two chunks.\n",
    "\n",
    "We define a function `chunk_text` to perform this splitting based on word counts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "chunking-func-v5",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Defining the 'chunk_text' function.\n",
      "Test chunking on first doc (size=150 words, overlap=30 words): Created 1 chunks.\n",
      "First sample chunk:\n",
      "'Solar power uses PV panels or CSP systems. PV converts sunlight directly to electricity. CSP uses mirrors to heat fluid driving a turbine. It's clean but varies with weather/time. Storage (batteries) is key for consistency.'\n",
      "-------------------------\n"
     ]
    }
   ],
   "source": [
    "def chunk_text(text, chunk_size, chunk_overlap):\n",
    "    \"\"\"Splits a single text document into overlapping chunks based on word count.\n",
    "\n",
    "    Args:\n",
    "        text (str): The input text to be chunked.\n",
    "        chunk_size (int): The target number of words per chunk.\n",
    "        chunk_overlap (int): The number of words to overlap between consecutive chunks.\n",
    "\n",
    "    Returns:\n",
    "        list[str]: A list of text chunks.\n",
    "    \"\"\"\n",
    "    words = text.split()      # Split the text into a list of individual words\n",
    "    total_words = len(words) # Calculate the total number of words in the text\n",
    "    chunks = []             # Initialize an empty list to store the generated chunks\n",
    "    start_index = 0         # Initialize the starting word index for the first chunk\n",
    "\n",
    "    # --- Input Validation ---\n",
    "    # Ensure chunk_size is a positive integer.\n",
    "    if not isinstance(chunk_size, int) or chunk_size <= 0:\n",
    "        print(f\"  Warning: Invalid chunk_size ({chunk_size}). Must be a positive integer. Returning the whole text as one chunk.\")\n",
    "        return [text]\n",
    "    # Ensure chunk_overlap is a non-negative integer smaller than chunk_size.\n",
    "    if not isinstance(chunk_overlap, int) or chunk_overlap < 0:\n",
    "        print(f\"  Warning: Invalid chunk_overlap ({chunk_overlap}). Must be a non-negative integer. Setting overlap to 0.\")\n",
    "        chunk_overlap = 0\n",
    "    if chunk_overlap >= chunk_size:\n",
    "        # If overlap is too large, adjust it to a reasonable fraction (e.g., 1/3) of chunk_size\n",
    "        # This prevents infinite loops or nonsensical chunking.\n",
    "        adjusted_overlap = chunk_size // 3\n",
    "        print(f\"  Warning: chunk_overlap ({chunk_overlap}) >= chunk_size ({chunk_size}). Adjusting overlap to {adjusted_overlap}.\")\n",
    "        chunk_overlap = adjusted_overlap\n",
    "\n",
    "    # --- Chunking Loop ---\n",
    "    # Continue chunking as long as the start_index is within the bounds of the text\n",
    "    while start_index < total_words:\n",
    "        # Determine the end index for the current chunk.\n",
    "        # It's the minimum of (start + chunk_size) and the total number of words.\n",
    "        end_index = min(start_index + chunk_size, total_words)\n",
    "        \n",
    "        # Extract the words for the current chunk and join them back into a single string.\n",
    "        current_chunk_text = \" \".join(words[start_index:end_index])\n",
    "        chunks.append(current_chunk_text) # Add the generated chunk to the list\n",
    "        \n",
    "        # Calculate the starting index for the *next* chunk.\n",
    "        # Move forward by (chunk_size - chunk_overlap) words.\n",
    "        next_start_index = start_index + chunk_size - chunk_overlap\n",
    "        \n",
    "        # --- Safety Checks ---\n",
    "        # Check 1: Prevent infinite loops if overlap causes no progress.\n",
    "        # This can happen if chunk_size is very small or overlap is very large relative to chunk_size.\n",
    "        if next_start_index <= start_index:\n",
    "            if end_index == total_words: # If we are already at the end, we can safely break.\n",
    "                break\n",
    "            else: \n",
    "                # Force progress by moving forward by at least one word.\n",
    "                print(f\"  Warning: Chunking logic stuck (start={start_index}, next_start={next_start_index}). Forcing progress.\")\n",
    "                next_start_index = start_index + 1 \n",
    "                \n",
    "        # Check 2: If the calculated next start index is already at or beyond the total number of words, we are done.\n",
    "        if next_start_index >= total_words:\n",
    "            break\n",
    "            \n",
    "        # Move the start_index to the calculated position for the next iteration.\n",
    "        start_index = next_start_index\n",
    "        \n",
    "    return chunks # Return the complete list of text chunks\n",
    "\n",
    "# --- Quick Test ---\n",
    "# Test the function with the first document and sample parameters.\n",
    "print(\"Defining the 'chunk_text' function.\")\n",
    "sample_chunk_size = 150\n",
    "sample_overlap = 30\n",
    "sample_chunks = chunk_text(corpus_texts[0], sample_chunk_size, sample_overlap) \n",
    "print(f\"Test chunking on first doc (size={sample_chunk_size} words, overlap={sample_overlap} words): Created {len(sample_chunks)} chunks.\")\n",
    "if sample_chunks: # Only print if chunks were created\n",
    "    print(f\"First sample chunk:\\n'{sample_chunks[0]}'\")\n",
    "print(\"-\" * 25)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "client-setup-md-v5",
   "metadata": {},
   "source": [
    "### 6. Core Component: Connecting to Nebius AI\n",
    "\n",
    "To use the Nebius AI models (for embedding, generation, evaluation), we need to establish a connection to their API. We use the `openai` Python library, which provides a convenient way to interact with OpenAI-compatible APIs like Nebius.\n",
    "\n",
    "We instantiate an `OpenAI` client object, providing our API key and the specific Nebius API endpoint URL."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "client-setup-v5",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Attempting to initialize the Nebius AI client...\n",
      "Nebius AI client initialized successfully. Ready to make API calls.\n",
      "Client setup step complete.\n",
      "-------------------------\n"
     ]
    }
   ],
   "source": [
    "client = None # Initialize client variable to None globally\n",
    "\n",
    "print(\"Attempting to initialize the Nebius AI client...\")\n",
    "try:\n",
    "    # Check if the API key is actually available before creating the client\n",
    "    if not NEBIUS_API_KEY:\n",
    "        raise ValueError(\"Nebius API Key is missing. Cannot initialize client.\")\n",
    "        \n",
    "    # Create the OpenAI client object, configured for the Nebius API.\n",
    "    client = OpenAI(\n",
    "        api_key=NEBIUS_API_KEY,     # Pass the API key loaded earlier\n",
    "        base_url=NEBIUS_BASE_URL  # Specify the Nebius API endpoint\n",
    "    )\n",
    "    \n",
    "    # Optional: Add a quick test call to verify the client connection,\n",
    "    # e.g., listing models (if supported and desired). This might incur costs.\n",
    "    # try:\n",
    "    #     client.models.list() \n",
    "    #     print(\"Client connection verified by listing models.\")\n",
    "    # except Exception as test_e:\n",
    "    #     print(f\"Warning: Could not verify client connection with test call: {test_e}\")\n",
    "    \n",
    "    print(\"Nebius AI client initialized successfully. Ready to make API calls.\")\n",
    "    \n",
    "except Exception as e:\n",
    "    # Catch any errors during client initialization (e.g., invalid key, network issues)\n",
    "    print(f\"Error initializing Nebius AI client: {e}\")\n",
    "    print(\"!!! Execution cannot proceed without a valid client. Please check your API key and network connection. !!!\")\n",
    "    # Setting client back to None to prevent further attempts if initialization failed\n",
    "    client = None \n",
    "\n",
    "print(\"Client setup step complete.\")\n",
    "print(\"-\" * 25)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "similarity-func-md-v5",
   "metadata": {},
   "source": [
    "### 7. Core Component: Cosine Similarity Function\n",
    "\n",
    "To evaluate how semantically similar the generated answer is to our ground truth answer, we use **Cosine Similarity**. This metric measures the cosine of the angle between two vectors (in our case, the embedding vectors of the two answers).\n",
    "\n",
    "- A score of **1** means the vectors point in the same direction (maximum similarity).\n",
    "- A score of **0** means the vectors are orthogonal (no similarity).\n",
    "- A score of **-1** means the vectors point in opposite directions (maximum dissimilarity).\n",
    "\n",
    "For text embeddings, scores typically range from 0 to 1, where higher values indicate greater semantic similarity.\n",
    "\n",
    "We define a function `calculate_cosine_similarity` that takes two text strings, generates their embeddings using the Nebius client, and returns their cosine similarity score."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "similarity-func-v5",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Defining the 'calculate_cosine_similarity' function.\n",
      "Testing similarity function: Similarity between 'apple' and 'orange' = 0.77\n",
      "-------------------------\n"
     ]
    }
   ],
   "source": [
    "def calculate_cosine_similarity(text1, text2, client, embedding_model):\n",
    "    \"\"\"Calculates cosine similarity between the embeddings of two texts.\n",
    "\n",
    "    Args:\n",
    "        text1 (str): The first text string.\n",
    "        text2 (str): The second text string.\n",
    "        client (OpenAI): The initialized Nebius AI client.\n",
    "        embedding_model (str): The name of the embedding model to use.\n",
    "\n",
    "    Returns:\n",
    "        float: The cosine similarity score (between 0.0 and 1.0), or 0.0 if an error occurs.\n",
    "    \"\"\"\n",
    "    if not client:\n",
    "        print(\"  Error: Nebius client not available for similarity calculation.\")\n",
    "        return 0.0\n",
    "    if not text1 or not text2:\n",
    "        # Handle cases where one or both texts might be empty or None\n",
    "        return 0.0\n",
    "        \n",
    "    try:\n",
    "        # Generate embeddings for both texts in a single API call if possible\n",
    "        response = client.embeddings.create(model=embedding_model, input=[text1, text2])\n",
    "        \n",
    "        # Extract the embedding vectors\n",
    "        embedding1 = np.array(response.data[0].embedding)\n",
    "        embedding2 = np.array(response.data[1].embedding)\n",
    "        \n",
    "        # Reshape vectors to be 2D arrays as expected by cosine_similarity\n",
    "        embedding1 = embedding1.reshape(1, -1)\n",
    "        embedding2 = embedding2.reshape(1, -1)\n",
    "        \n",
    "        # Calculate cosine similarity using scikit-learn\n",
    "        # cosine_similarity returns a 2D array, e.g., [[similarity]], so we extract the value.\n",
    "        similarity_score = cosine_similarity(embedding1, embedding2)[0][0]\n",
    "        \n",
    "        # Clamp the score between 0.0 and 1.0 for safety/consistency\n",
    "        return max(0.0, min(1.0, similarity_score))\n",
    "        \n",
    "    except Exception as e:\n",
    "        print(f\"  Error calculating cosine similarity: {e}\")\n",
    "        return 0.0 # Return 0.0 in case of any API or calculation errors\n",
    "\n",
    "# --- Quick Test ---\n",
    "print(\"Defining the 'calculate_cosine_similarity' function.\")\n",
    "if client: # Only run test if client is initialized\n",
    "    test_sim = calculate_cosine_similarity(\"apple\", \"orange\", client, NEBIUS_EMBEDDING_MODEL)\n",
    "    print(f\"Testing similarity function: Similarity between 'apple' and 'orange' = {test_sim:.2f}\")\n",
    "else:\n",
    "    print(\"Skipping similarity function test as Nebius client is not initialized.\")\n",
    "print(\"-\" * 25)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "main-loop-md-v5",
   "metadata": {},
   "source": [
    "### 8. The Experiment: Iterating Through Configurations\n",
    "\n",
    "This section contains the main experimental loop. We will systematically iterate through all combinations of the tuning parameters we defined earlier (`CHUNK_SIZES_TO_TEST`, `CHUNK_OVERLAPS_TO_TEST`, `RETRIEVAL_TOP_K_TO_TEST`).\n",
    "\n",
    "**Workflow for Each Parameter Combination:**\n",
    "\n",
    "1.  **Prepare Data (Chunking/Embedding/Indexing - Step 8.1):**\n",
    "    *   **Check if Re-computation Needed:** If the `chunk_size` or `chunk_overlap` has changed from the previous iteration, we need to re-process the corpus.\n",
    "    *   **Chunking:** Split all documents in `corpus_texts` using the current `chunk_size` and `chunk_overlap` via the `chunk_text` function.\n",
    "    *   **Embedding:** Convert each text chunk into a numerical vector (embedding) using the specified Nebius embedding model (`NEBIUS_EMBEDDING_MODEL`). We do this in batches for efficiency.\n",
    "    *   **Indexing:** Build a FAISS index (`IndexFlatL2`) from the generated embeddings. FAISS allows for very fast searching to find the chunks whose embeddings are most similar to the query embedding.\n",
    "    *   *Optimization:* If chunk settings haven't changed, we reuse the existing chunks, embeddings, and index from the previous iteration to save time and API calls.\n",
    "\n",
    "2.  **Test RAG Strategies (Step 8.2):**\n",
    "    *   For the current `top_k` value, run each of the defined RAG strategies:\n",
    "        *   **Simple RAG:** Retrieve `top_k` chunks based on similarity to the original query.\n",
    "        *   **Query Rewrite RAG:** First, ask the LLM to rewrite the original query to be potentially better for vector search. Then, retrieve `top_k` chunks based on similarity to the *rewritten* query.\n",
    "        *   **Rerank RAG (Simulated):** Retrieve more chunks initially (`top_k * RERANK_RETRIEVAL_MULTIPLIER`). Then, *simulate* reranking by simply taking the top `top_k` results from this larger initial set. (A real implementation would use a more sophisticated reranking model).\n",
    "\n",
    "3.  **Evaluate & Store Results (Step 8.3 within `run_and_evaluate`):**\n",
    "    *   For each strategy run:\n",
    "        *   **Retrieve:** Find the relevant chunk indices using the FAISS index.\n",
    "        *   **Generate:** Construct a prompt containing the retrieved chunk(s) as context and the *original* `test_query`. Send this to the Nebius generation model (`NEBIUS_GENERATION_MODEL`) to get the final answer.\n",
    "        *   **Evaluate (Faithfulness):** Use the LLM evaluator (`NEBIUS_EVALUATION_MODEL`) with the `FAITHFULNESS_PROMPT` to score how well the generated answer aligns with the `true_answer_for_query`.\n",
    "        *   **Evaluate (Relevancy):** Use the LLM evaluator with the `RELEVANCY_PROMPT` to score how well the generated answer addresses the `test_query`.\n",
    "        *   **Evaluate (Similarity):** Use our `calculate_cosine_similarity` function to get the semantic similarity score between the generated answer and the `true_answer_for_query`.\n",
    "        *   **Calculate Average Score:** Compute the average of Faithfulness, Relevancy, and Similarity scores.\n",
    "        *   **Record:** Store all parameters (`chunk_size`, `overlap`, `top_k`, `strategy`), the retrieved indices, the rewritten query (if applicable), the generated answer, the individual scores, the average score, and the execution time for this specific run.\n",
    "\n",
    "We use `tqdm` to display a progress bar for the outer loop iterating through parameter combinations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "main-loop-exec-v5",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "=== Starting RAG Experiment Loop ===\n",
      "\n",
      "Total parameter combinations to test: 8\n"
     ]
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "29485f798215461b910fc4eb4c8546d7",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Testing Configurations:   0%|          | 0/8 [00:00<?, ?it/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    Finished: Simple RAG (C=150, O=30, K=3). AvgScore=0.89, Time=609.06s\n",
      "    Finished: Query Rewrite RAG (C=150, O=30, K=3). AvgScore=0.89, Time=10.36s\n",
      "    Finished: Rerank RAG (Simulated) (C=150, O=30, K=3). AvgScore=0.89, Time=9.53s\n",
      "    Finished: Simple RAG (C=150, O=30, K=5). AvgScore=0.89, Time=8.40s\n",
      "    Finished: Query Rewrite RAG (C=150, O=30, K=5). AvgScore=0.89, Time=8.36s\n",
      "    Finished: Rerank RAG (Simulated) (C=150, O=30, K=5). AvgScore=0.89, Time=8.34s\n",
      "    Finished: Simple RAG (C=150, O=50, K=3). AvgScore=0.89, Time=9.78s\n",
      "    Finished: Query Rewrite RAG (C=150, O=50, K=3). AvgScore=0.89, Time=9.68s\n",
      "    Finished: Rerank RAG (Simulated) (C=150, O=50, K=3). AvgScore=0.89, Time=8.43s\n",
      "    Finished: Simple RAG (C=150, O=50, K=5). AvgScore=0.89, Time=9.74s\n",
      "    Finished: Query Rewrite RAG (C=150, O=50, K=5). AvgScore=0.89, Time=9.39s\n",
      "    Finished: Rerank RAG (Simulated) (C=150, O=50, K=5). AvgScore=0.89, Time=8.53s\n",
      "    Finished: Simple RAG (C=250, O=30, K=3). AvgScore=0.89, Time=9.36s\n",
      "    Finished: Query Rewrite RAG (C=250, O=30, K=3). AvgScore=0.89, Time=8.36s\n",
      "    Finished: Rerank RAG (Simulated) (C=250, O=30, K=3). AvgScore=0.89, Time=9.56s\n",
      "    Finished: Simple RAG (C=250, O=30, K=5). AvgScore=0.89, Time=8.77s\n",
      "    Finished: Query Rewrite RAG (C=250, O=30, K=5). AvgScore=0.89, Time=9.63s\n",
      "    Finished: Rerank RAG (Simulated) (C=250, O=30, K=5). AvgScore=0.89, Time=6.63s\n",
      "    Finished: Simple RAG (C=250, O=50, K=3). AvgScore=0.90, Time=8.98s\n",
      "    Finished: Query Rewrite RAG (C=250, O=50, K=3). AvgScore=0.90, Time=6.55s\n",
      "    Finished: Rerank RAG (Simulated) (C=250, O=50, K=3). AvgScore=0.89, Time=41.23s\n",
      "    Finished: Simple RAG (C=250, O=50, K=5). AvgScore=0.89, Time=6.93s\n",
      "    Finished: Query Rewrite RAG (C=250, O=50, K=5). AvgScore=0.89, Time=6.11s\n",
      "    Finished: Rerank RAG (Simulated) (C=250, O=50, K=5). AvgScore=0.89, Time=7.09s\n",
      "\n",
      "=== RAG Experiment Loop Finished ===\n",
      "-------------------------\n"
     ]
    }
   ],
   "source": [
    "# List to store the detailed results from each experimental run\n",
    "all_results = []\n",
    "\n",
    "# --- Cache variables for Chunking/Embedding/Indexing --- \n",
    "# These variables help us avoid redundant computations when only 'top_k' changes.\n",
    "last_chunk_size = -1      # Stores the chunk_size used in the previous iteration\n",
    "last_overlap = -1         # Stores the chunk_overlap used in the previous iteration\n",
    "current_index = None      # Holds the active FAISS index\n",
    "current_chunks = []       # Holds the list of text chunks for the active settings\n",
    "current_embeddings = None # Holds the numpy array of embeddings for the active chunks\n",
    "\n",
    "# Check if the Nebius client was initialized successfully before starting\n",
    "if not client:\n",
    "    print(\"STOPPING: Nebius AI client is not initialized. Cannot run experiment.\")\n",
    "else:\n",
    "    print(\"=== Starting RAG Experiment Loop ===\\n\")\n",
    "    \n",
    "    # Create all possible combinations of the tuning parameters\n",
    "    param_combinations = list(itertools.product(\n",
    "        CHUNK_SIZES_TO_TEST,\n",
    "        CHUNK_OVERLAPS_TO_TEST,\n",
    "        RETRIEVAL_TOP_K_TO_TEST\n",
    "    ))\n",
    "    \n",
    "    print(f\"Total parameter combinations to test: {len(param_combinations)}\")\n",
    "    \n",
    "    # --- Main Loop --- \n",
    "    # Iterate through each combination (chunk_size, chunk_overlap, top_k)\n",
    "    # Use tqdm to display a progress bar.\n",
    "    for chunk_size, chunk_overlap, top_k in tqdm(param_combinations, desc=\"Testing Configurations\"):\n",
    "        \n",
    "        # --- 8.1 Processing a Chunking Configuration --- \n",
    "        # Check if chunk settings have changed, requiring re-processing.\n",
    "        if chunk_size != last_chunk_size or chunk_overlap != last_overlap:\n",
    "            # Uncomment the line below for more verbose logging during execution\n",
    "            # print(f\"\\n--- Processing New Chunk Config: Size={chunk_size}, Overlap={chunk_overlap} ---\")\n",
    "            \n",
    "            # Update cache variables\n",
    "            last_chunk_size, last_overlap = chunk_size, chunk_overlap\n",
    "            # Reset index, chunks, and embeddings for the new configuration\n",
    "            current_index = None \n",
    "            current_chunks = []\n",
    "            current_embeddings = None\n",
    "            \n",
    "            # --- 8.1a: Chunking --- \n",
    "            # Apply the chunk_text function to each document in the corpus\n",
    "            try:\n",
    "                # print(\"  Chunking documents...\") # Uncomment for verbose logging\n",
    "                temp_chunks = []\n",
    "                for doc_index, doc in enumerate(corpus_texts):\n",
    "                    doc_chunks = chunk_text(doc, chunk_size, chunk_overlap)\n",
    "                    if not doc_chunks:\n",
    "                         print(f\"  Warning: No chunks created for document {doc_index} with size={chunk_size}, overlap={chunk_overlap}. Skipping document.\")\n",
    "                         continue\n",
    "                    temp_chunks.extend(doc_chunks)\n",
    "                \n",
    "                current_chunks = temp_chunks\n",
    "                if not current_chunks:\n",
    "                    # If no chunks were created at all (e.g., due to invalid settings or empty corpus)\n",
    "                    raise ValueError(\"No chunks were created for the current configuration.\")\n",
    "                # print(f\"    Created {len(current_chunks)} chunks total.\") # Uncomment for verbose logging\n",
    "            except Exception as e:\n",
    "                 print(f\"    ERROR during chunking for Size={chunk_size}, Overlap={chunk_overlap}: {e}. Skipping this configuration.\")\n",
    "                 last_chunk_size, last_overlap = -1, -1 # Reset cache state\n",
    "                 continue # Move to the next parameter combination\n",
    "            \n",
    "            # --- 8.1b: Embedding --- \n",
    "            # Generate embeddings for all chunks using the Nebius API.\n",
    "            # print(\"  Generating embeddings...\") # Uncomment for verbose logging\n",
    "            try:\n",
    "                batch_size = 32 # Process chunks in batches to avoid overwhelming the API or hitting limits.\n",
    "                temp_embeddings = [] # Temporary list to store embedding vectors\n",
    "                \n",
    "                # Loop through chunks in batches\n",
    "                for i in range(0, len(current_chunks), batch_size):\n",
    "                    batch_texts = current_chunks[i : min(i + batch_size, len(current_chunks))]\n",
    "                    # Make the API call to Nebius for the current batch\n",
    "                    response = client.embeddings.create(model=NEBIUS_EMBEDDING_MODEL, input=batch_texts)\n",
    "                    # Extract the embedding vectors from the API response\n",
    "                    batch_embeddings = [item.embedding for item in response.data]\n",
    "                    temp_embeddings.extend(batch_embeddings)\n",
    "                    time.sleep(0.05) # Add a small delay between batches to be polite to the API endpoint.\n",
    "                \n",
    "                # Convert the list of embeddings into a single NumPy array\n",
    "                current_embeddings = np.array(temp_embeddings)\n",
    "                # Basic validation check on the embeddings array\n",
    "                if current_embeddings.ndim != 2 or current_embeddings.shape[0] != len(current_chunks):\n",
    "                    raise ValueError(f\"Embeddings array shape mismatch. Expected ({len(current_chunks)}, dim), Got {current_embeddings.shape}\")\n",
    "                # print(f\"    Generated {current_embeddings.shape[0]} embeddings (Dimension: {current_embeddings.shape[1]}).\") # Uncomment for verbose logging\n",
    "\n",
    "            except Exception as e:\n",
    "                print(f\"    ERROR generating embeddings for Size={chunk_size}, Overlap={chunk_overlap}: {e}. Skipping this chunk config.\")\n",
    "                # Reset cache variables to indicate failure for this chunk setting\n",
    "                last_chunk_size, last_overlap = -1, -1 \n",
    "                current_chunks = []\n",
    "                current_embeddings = None\n",
    "                continue # Skip to the next parameter combination\n",
    "                \n",
    "            # --- 8.1c: Indexing --- \n",
    "            # Build a FAISS index for efficient similarity search.\n",
    "            # print(\"  Building FAISS search index...\") # Uncomment for verbose logging\n",
    "            try:\n",
    "                embedding_dim = current_embeddings.shape[1] # Get the dimensionality of the embeddings\n",
    "                # We use IndexFlatL2, which performs exact search using L2 (Euclidean) distance.\n",
    "                # For high-dimensional vectors from modern embedding models, cosine similarity often works better,\n",
    "                # but FAISS's IndexFlatIP (Inner Product) is closely related. For normalized embeddings (like many BGE models),\n",
    "                # L2 distance and Inner Product/Cosine Similarity ranking are equivalent.\n",
    "                current_index = faiss.IndexFlatL2(embedding_dim)\n",
    "                # Add the chunk embeddings to the index. FAISS requires float32 data type.\n",
    "                current_index.add(current_embeddings.astype('float32'))\n",
    "                \n",
    "                if current_index.ntotal == 0:\n",
    "                     raise ValueError(\"FAISS index is empty after adding vectors. No vectors were added.\")\n",
    "                # print(f\"    FAISS index ready with {current_index.ntotal} vectors.\") # Uncomment for verbose logging\n",
    "            except Exception as e:\n",
    "                print(f\"    ERROR building FAISS index for Size={chunk_size}, Overlap={chunk_overlap}: {e}. Skipping this chunk config.\")\n",
    "                # Reset variables to indicate failure\n",
    "                last_chunk_size, last_overlap = -1, -1\n",
    "                current_index = None\n",
    "                current_embeddings = None\n",
    "                current_chunks = []\n",
    "                continue # Skip to the next parameter combination\n",
    "        \n",
    "        # --- 8.2 Testing RAG Strategies for the Current Top-K --- \n",
    "        # If we reach this point, we have a valid index and chunks for the current chunk_size/overlap.\n",
    "        \n",
    "        # Check if the index and chunks are actually available (safety check)\n",
    "        if current_index is None or not current_chunks:\n",
    "            print(f\"    WARNING: Index or chunks not available for Size={chunk_size}, Overlap={chunk_overlap}. Skipping Top-K={top_k} test.\")\n",
    "            continue\n",
    "            \n",
    "        # --- 8.3 Running & Evaluating a Single RAG Strategy --- \n",
    "        # Define a nested function to perform the core RAG steps (retrieve, generate, evaluate)\n",
    "        # This avoids code repetition for each strategy.\n",
    "        def run_and_evaluate(strategy_name, query_to_use, k_retrieve, use_simulated_rerank=False):\n",
    "            # print(f\"    Starting: {strategy_name} (k={k_retrieve}) ...\") # Uncomment for verbose logging\n",
    "            run_start_time = time.time() # Record start time for timing the run\n",
    "            \n",
    "            # Initialize a dictionary to store results for this specific run\n",
    "            result = {\n",
    "                'chunk_size': chunk_size, 'overlap': chunk_overlap, 'top_k': k_retrieve, \n",
    "                'strategy': strategy_name,\n",
    "                'retrieved_indices': [], 'rewritten_query': None, 'answer': 'Error: Execution Failed',\n",
    "                'faithfulness': 0.0, 'relevancy': 0.0, 'similarity_score': 0.0, 'avg_score': 0.0, \n",
    "                'time_sec': 0.0\n",
    "            }\n",
    "            # Store the rewritten query if applicable\n",
    "            if strategy_name == \"Query Rewrite RAG\": \n",
    "                result['rewritten_query'] = query_to_use\n",
    "\n",
    "            try:\n",
    "                # --- Retrieval Step --- \n",
    "                k_for_search = k_retrieve # Number of chunks to retrieve initially\n",
    "                if use_simulated_rerank:\n",
    "                    # For simulated rerank, retrieve more candidates initially\n",
    "                    k_for_search = k_retrieve * RERANK_RETRIEVAL_MULTIPLIER\n",
    "                    # print(f\"      Rerank: Retrieving initial {k_for_search} candidates.\") # Uncomment for verbose logging\n",
    "                \n",
    "                # 1. Embed the query (original or rewritten)\n",
    "                query_embedding_response = client.embeddings.create(model=NEBIUS_EMBEDDING_MODEL, input=[query_to_use])\n",
    "                query_embedding = query_embedding_response.data[0].embedding\n",
    "                query_vector = np.array([query_embedding]).astype('float32') # FAISS needs float32 numpy array\n",
    "                \n",
    "                # 2. Perform the search in the FAISS index\n",
    "                # Ensure k is not greater than the total number of items in the index\n",
    "                actual_k = min(k_for_search, current_index.ntotal)\n",
    "                if actual_k == 0:\n",
    "                    raise ValueError(\"Index is empty or k_for_search is zero, cannot search.\")\n",
    "                \n",
    "                # `current_index.search` returns distances and indices of the nearest neighbors\n",
    "                distances, indices = current_index.search(query_vector, actual_k)\n",
    "                \n",
    "                # 3. Process retrieved indices\n",
    "                # Indices can contain -1 if fewer than 'actual_k' vectors are found (shouldn't happen with IndexFlatL2 unless k > ntotal)\n",
    "                retrieved_indices_all = indices[0]\n",
    "                valid_indices = retrieved_indices_all[retrieved_indices_all != -1].tolist()\n",
    "                \n",
    "                # 4. Apply simulated reranking (if applicable)\n",
    "                # In this simulation, we just take the top 'k_retrieve' results from the initially retrieved set.\n",
    "                # A real reranker would re-score these 'k_for_search' candidates based on relevance to the query.\n",
    "                if use_simulated_rerank:\n",
    "                    final_indices = valid_indices[:k_retrieve]\n",
    "                    # print(f\"      Rerank: Selected top {len(final_indices)} indices after simulated rerank.\") # Uncomment for verbose logging\n",
    "                else:\n",
    "                    final_indices = valid_indices # Use all valid retrieved indices up to k_retrieve\n",
    "                \n",
    "                result['retrieved_indices'] = final_indices\n",
    "                \n",
    "                # 5. Get the actual text chunks corresponding to the final indices\n",
    "                retrieved_chunks = [current_chunks[i] for i in final_indices]\n",
    "                \n",
    "                # Handle case where no chunks were retrieved (should be rare with valid indices)\n",
    "                if not retrieved_chunks:\n",
    "                    print(f\"      Warning: No relevant chunks found for {strategy_name} (C={chunk_size}, O={chunk_overlap}, K={k_retrieve}). Setting answer to indicate this.\")\n",
    "                    result['answer'] = \"No relevant context found in the documents based on the query.\"\n",
    "                    # Keep scores at 0.0 as no answer was generated from context\n",
    "                else:\n",
    "                    # --- Generation Step --- \n",
    "                    # Combine the retrieved chunks into a single context string\n",
    "                    context_str = \"\\n\\n\".join(retrieved_chunks)\n",
    "                    \n",
    "                    # Define the system prompt for the generation LLM\n",
    "                    sys_prompt_gen = \"You are a helpful AI assistant. Answer the user's query based strictly on the provided context. If the context doesn't contain the answer, state that clearly. Be concise.\"\n",
    "                    \n",
    "                    # Construct the user prompt including the context and the *original* query\n",
    "                    # It's important to use the original query here for generating the final answer, even if a rewritten query was used for retrieval.\n",
    "                    user_prompt_gen = f\"Context:\\n------\\n{context_str}\\n------\\n\\nQuery: {test_query}\\n\\nAnswer:\"\n",
    "                    \n",
    "                    # Make the API call to the Nebius generation model\n",
    "                    gen_response = client.chat.completions.create(\n",
    "                        model=NEBIUS_GENERATION_MODEL, \n",
    "                        messages=[\n",
    "                            {\"role\": \"system\", \"content\": sys_prompt_gen},\n",
    "                            {\"role\": \"user\", \"content\": user_prompt_gen}\n",
    "                        ],\n",
    "                        temperature=GENERATION_TEMPERATURE,\n",
    "                        max_tokens=GENERATION_MAX_TOKENS,\n",
    "                        top_p=GENERATION_TOP_P\n",
    "                    )\n",
    "                    # Extract the generated text answer\n",
    "                    generated_answer = gen_response.choices[0].message.content.strip()\n",
    "                    result['answer'] = generated_answer\n",
    "                    # Optional: print a snippet of the generated answer\n",
    "                    # print(f\"      Generated Answer: {generated_answer[:100].replace('\\n', ' ')}...\") \n",
    "\n",
    "                    # --- Evaluation Step --- \n",
    "                    # Evaluate the generated answer using Faithfulness, Relevancy, and Similarity\n",
    "                    # print(f\"      Evaluating answer... (Faithfulness, Relevancy, Similarity)\") # Uncomment for verbose logging\n",
    "                    \n",
    "                    # Prepare parameters for evaluation calls (use low temperature for deterministic scoring)\n",
    "                    eval_params = {'model': NEBIUS_EVALUATION_MODEL, 'temperature': 0.0, 'max_tokens': 10}\n",
    "                    \n",
    "                    # 1. Faithfulness Evaluation Call\n",
    "                    prompt_f = FAITHFULNESS_PROMPT.format(question=test_query, response=generated_answer, true_answer=true_answer_for_query)\n",
    "                    try:\n",
    "                        resp_f = client.chat.completions.create(messages=[{\"role\": \"user\", \"content\": prompt_f}], **eval_params)\n",
    "                        # Attempt to parse the score, clamp between 0.0 and 1.0\n",
    "                        result['faithfulness'] = max(0.0, min(1.0, float(resp_f.choices[0].message.content.strip())))\n",
    "                    except Exception as eval_e:\n",
    "                        print(f\"      Warning: Faithfulness score parsing error for {strategy_name} - {eval_e}. Score set to 0.0\")\n",
    "                        result['faithfulness'] = 0.0\n",
    "\n",
    "                    # 2. Relevancy Evaluation Call\n",
    "                    prompt_r = RELEVANCY_PROMPT.format(question=test_query, response=generated_answer)\n",
    "                    try:\n",
    "                        resp_r = client.chat.completions.create(messages=[{\"role\": \"user\", \"content\": prompt_r}], **eval_params)\n",
    "                        # Attempt to parse the score, clamp between 0.0 and 1.0\n",
    "                        result['relevancy'] = max(0.0, min(1.0, float(resp_r.choices[0].message.content.strip())))\n",
    "                    except Exception as eval_e:\n",
    "                        print(f\"      Warning: Relevancy score parsing error for {strategy_name} - {eval_e}. Score set to 0.0\")\n",
    "                        result['relevancy'] = 0.0\n",
    "                    \n",
    "                    # 3. Similarity Score Calculation\n",
    "                    result['similarity_score'] = calculate_cosine_similarity(\n",
    "                        generated_answer, \n",
    "                        true_answer_for_query, \n",
    "                        client, \n",
    "                        NEBIUS_EMBEDDING_MODEL\n",
    "                    )\n",
    "                    \n",
    "                    # 4. Calculate Average Score (Faithfulness, Relevancy, Similarity)\n",
    "                    result['avg_score'] = (result['faithfulness'] + result['relevancy'] + result['similarity_score']) / 3.0\n",
    "            \n",
    "            except Exception as e:\n",
    "                # Catch any unexpected errors during the retrieve/generate/evaluate process\n",
    "                error_message = f\"ERROR during {strategy_name} (C={chunk_size}, O={chunk_overlap}, K={k_retrieve}): {str(e)[:200]}...\"\n",
    "                print(f\"    {error_message}\")\n",
    "                result['answer'] = error_message # Store the error in the answer field\n",
    "                # Ensure scores remain at their default error state (0.0)\n",
    "                result['faithfulness'] = 0.0\n",
    "                result['relevancy'] = 0.0\n",
    "                result['similarity_score'] = 0.0\n",
    "                result['avg_score'] = 0.0\n",
    "            \n",
    "            # Record the total time taken for this run\n",
    "            run_end_time = time.time()\n",
    "            result['time_sec'] = run_end_time - run_start_time\n",
    "            \n",
    "            # Print a summary line for this run (useful for monitoring progress)\n",
    "            print(f\"    Finished: {strategy_name} (C={chunk_size}, O={chunk_overlap}, K={k_retrieve}). AvgScore={result['avg_score']:.2f}, Time={result['time_sec']:.2f}s\")\n",
    "            return result\n",
    "        # --- End of run_and_evaluate nested function ---\n",
    "\n",
    "        # --- Execute the RAG Strategies using the run_and_evaluate function --- \n",
    "        \n",
    "        # Strategy 1: Simple RAG (Use original query for retrieval)\n",
    "        result_simple = run_and_evaluate(\"Simple RAG\", test_query, top_k)\n",
    "        all_results.append(result_simple)\n",
    "\n",
    "        # Strategy 2: Query Rewrite RAG \n",
    "        rewritten_q = test_query # Default to original query if rewrite fails\n",
    "        try:\n",
    "             # print(\"    Attempting query rewrite for Rewrite RAG...\") # Uncomment for verbose logging\n",
    "             # Define prompts for the query rewriting task\n",
    "             sys_prompt_rw = \"You are an expert query optimizer. Rewrite the user's query to be ideal for vector database retrieval. Focus on key entities, concepts, and relationships. Remove conversational fluff. Output ONLY the rewritten query text.\"\n",
    "             user_prompt_rw = f\"Original Query: {test_query}\\n\\nRewritten Query:\"\n",
    "             \n",
    "             # Call the LLM to rewrite the query\n",
    "             resp_rw = client.chat.completions.create(\n",
    "                 model=NEBIUS_GENERATION_MODEL, # Can use the generation model for this task too\n",
    "                 messages=[\n",
    "                     {\"role\": \"system\", \"content\": sys_prompt_rw},\n",
    "                     {\"role\": \"user\", \"content\": user_prompt_rw}\n",
    "                 ],\n",
    "                 temperature=0.1, # Low temp for focused rewrite\n",
    "                 max_tokens=100, \n",
    "                 top_p=0.9\n",
    "             )\n",
    "             # Clean up the LLM's response to get just the query text\n",
    "             candidate_q = resp_rw.choices[0].message.content.strip()\n",
    "             # Remove potential prefixes like \"Rewritten Query:\" or \"Query:\"\n",
    "             candidate_q = re.sub(r'^(rewritten query:|query:)\\s*', '', candidate_q, flags=re.IGNORECASE).strip('\"')\n",
    "             \n",
    "             # Use the rewritten query only if it's reasonably different and not too short\n",
    "             if candidate_q and len(candidate_q) > 5 and candidate_q.lower() != test_query.lower(): \n",
    "                 rewritten_q = candidate_q\n",
    "                 # print(f\"      Using rewritten query: '{rewritten_q}'\") # Uncomment for verbose logging\n",
    "             # else: \n",
    "                 # print(\"      Rewrite failed, too short, or same as original. Using original query.\") # Uncomment for verbose logging\n",
    "        except Exception as e:\n",
    "             print(f\"    Warning: Error during query rewrite: {e}. Using original query.\")\n",
    "             rewritten_q = test_query # Fallback to original query on error\n",
    "             \n",
    "        # Run evaluation using the (potentially) rewritten query for retrieval\n",
    "        result_rewrite = run_and_evaluate(\"Query Rewrite RAG\", rewritten_q, top_k)\n",
    "        all_results.append(result_rewrite)\n",
    "\n",
    "        # Strategy 3: Rerank RAG (Simulated)\n",
    "        # Use original query for retrieval, but simulate the reranking process\n",
    "        result_rerank = run_and_evaluate(\"Rerank RAG (Simulated)\", test_query, top_k, use_simulated_rerank=True)\n",
    "        all_results.append(result_rerank)\n",
    "\n",
    "    print(\"\\n=== RAG Experiment Loop Finished ===\")\n",
    "    print(\"-\" * 25)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "analysis-v5",
   "metadata": {},
   "source": [
    "### 9. Analysis: Reviewing the Results\n",
    "\n",
    "Now that the experiment loop has completed and `all_results` contains the data from each run, we'll use the Pandas library to analyze the findings.\n",
    "\n",
    "1.  **Create DataFrame:** Convert the list of result dictionaries (`all_results`) into a Pandas DataFrame for easy manipulation and viewing.\n",
    "2.  **Sort Results:** Sort the DataFrame by the `avg_score` (the average of Faithfulness, Relevancy, and Similarity) in descending order, so the best-performing configurations appear first.\n",
    "3.  **Display Top Configurations:** Show the top N rows of the sorted DataFrame, including key parameters, scores, and the generated answer, to quickly identify promising settings.\n",
    "4.  **Summarize Best Run:** Print a clear summary of the single best-performing configuration based on the average score, showing its parameters, individual scores, time taken, and the full answer it generated."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "compare-results-v5",
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--- Analyzing Experiment Results ---\n",
      "Total results collected: 24\n",
      "\n",
      "--- Top 10 Performing Configurations (Sorted by Average Score) ---\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>chunk_size</th>\n",
       "      <th>overlap</th>\n",
       "      <th>top_k</th>\n",
       "      <th>strategy</th>\n",
       "      <th>avg_score</th>\n",
       "      <th>faithfulness</th>\n",
       "      <th>relevancy</th>\n",
       "      <th>similarity_score</th>\n",
       "      <th>time_sec</th>\n",
       "      <th>answer</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>250</td>\n",
       "      <td>50</td>\n",
       "      <td>3</td>\n",
       "      <td>Simple RAG</td>\n",
       "      <td>0.899417</td>\n",
       "      <td>0.9</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.798251</td>\n",
       "      <td>8.975824</td>\n",
       "      <td>Solar power and hydropower differ significantly in consistency and environmental impact:\\n\\n- **Consistency**:  \\n  - **Solar Power**: Inconsisten...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>250</td>\n",
       "      <td>50</td>\n",
       "      <td>3</td>\n",
       "      <td>Query Rewrite RAG</td>\n",
       "      <td>0.896859</td>\n",
       "      <td>0.9</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.790578</td>\n",
       "      <td>6.550637</td>\n",
       "      <td>**Consistency:**\\n- **Hydropower** is highly reliable and provides consistent, large-scale power, as it is not dependent on weather conditions onc...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>150</td>\n",
       "      <td>30</td>\n",
       "      <td>3</td>\n",
       "      <td>Rerank RAG (Simulated)</td>\n",
       "      <td>0.894125</td>\n",
       "      <td>0.9</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.782374</td>\n",
       "      <td>9.526656</td>\n",
       "      <td>**Consistency:**  \\n- **Hydropower** is highly reliable and consistent, providing large-scale power 24/7, as it is not dependent on weather or tim...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>150</td>\n",
       "      <td>50</td>\n",
       "      <td>3</td>\n",
       "      <td>Query Rewrite RAG</td>\n",
       "      <td>0.893823</td>\n",
       "      <td>0.9</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.781468</td>\n",
       "      <td>9.675948</td>\n",
       "      <td>**Consistency:**  \\n- **Hydropower** is highly reliable and provides consistent, large-scale power, as it is not dependent on weather conditions o...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>150</td>\n",
       "      <td>30</td>\n",
       "      <td>3</td>\n",
       "      <td>Query Rewrite RAG</td>\n",
       "      <td>0.893666</td>\n",
       "      <td>0.9</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.780997</td>\n",
       "      <td>10.357061</td>\n",
       "      <td>**Consistency:**\\n- **Hydropower** is highly reliable and consistent, providing large-scale power 24/7, as it relies on the continuous flow of wat...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>150</td>\n",
       "      <td>50</td>\n",
       "      <td>3</td>\n",
       "      <td>Simple RAG</td>\n",
       "      <td>0.892774</td>\n",
       "      <td>0.9</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.778321</td>\n",
       "      <td>9.777294</td>\n",
       "      <td>**Consistency:**  \\n- **Hydropower** is highly consistent and reliable, providing large-scale power 24/7, especially with dams that store water fo...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>250</td>\n",
       "      <td>50</td>\n",
       "      <td>3</td>\n",
       "      <td>Rerank RAG (Simulated)</td>\n",
       "      <td>0.891570</td>\n",
       "      <td>0.9</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.774709</td>\n",
       "      <td>41.228211</td>\n",
       "      <td>**Consistency:**  \\n- **Hydropower** is highly reliable and provides consistent, large-scale power, especially with dams that can store water and ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>250</td>\n",
       "      <td>30</td>\n",
       "      <td>3</td>\n",
       "      <td>Query Rewrite RAG</td>\n",
       "      <td>0.890878</td>\n",
       "      <td>0.9</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.772635</td>\n",
       "      <td>8.359087</td>\n",
       "      <td>**Consistency:**  \\n- **Hydropower** is highly reliable and provides consistent, large-scale power, especially with dams that can store water and ...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>250</td>\n",
       "      <td>30</td>\n",
       "      <td>5</td>\n",
       "      <td>Simple RAG</td>\n",
       "      <td>0.890867</td>\n",
       "      <td>0.9</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.772601</td>\n",
       "      <td>8.767287</td>\n",
       "      <td>**Consistency:**  \\n- **Solar Power:** Inconsistent due to dependence on weather and daylight. Requires storage solutions (e.g., batteries) for re...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>150</td>\n",
       "      <td>50</td>\n",
       "      <td>5</td>\n",
       "      <td>Simple RAG</td>\n",
       "      <td>0.890656</td>\n",
       "      <td>0.9</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.771967</td>\n",
       "      <td>9.743746</td>\n",
       "      <td>**Consistency:**  \\n- **Solar Power:** Inconsistent due to dependence on weather and daylight. Requires storage solutions (e.g., batteries) for re...</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   chunk_size  overlap  top_k                strategy  avg_score  \\\n",
       "0         250       50      3              Simple RAG   0.899417   \n",
       "1         250       50      3       Query Rewrite RAG   0.896859   \n",
       "2         150       30      3  Rerank RAG (Simulated)   0.894125   \n",
       "3         150       50      3       Query Rewrite RAG   0.893823   \n",
       "4         150       30      3       Query Rewrite RAG   0.893666   \n",
       "5         150       50      3              Simple RAG   0.892774   \n",
       "6         250       50      3  Rerank RAG (Simulated)   0.891570   \n",
       "7         250       30      3       Query Rewrite RAG   0.890878   \n",
       "8         250       30      5              Simple RAG   0.890867   \n",
       "9         150       50      5              Simple RAG   0.890656   \n",
       "\n",
       "   faithfulness  relevancy  similarity_score   time_sec  \\\n",
       "0           0.9        1.0          0.798251   8.975824   \n",
       "1           0.9        1.0          0.790578   6.550637   \n",
       "2           0.9        1.0          0.782374   9.526656   \n",
       "3           0.9        1.0          0.781468   9.675948   \n",
       "4           0.9        1.0          0.780997  10.357061   \n",
       "5           0.9        1.0          0.778321   9.777294   \n",
       "6           0.9        1.0          0.774709  41.228211   \n",
       "7           0.9        1.0          0.772635   8.359087   \n",
       "8           0.9        1.0          0.772601   8.767287   \n",
       "9           0.9        1.0          0.771967   9.743746   \n",
       "\n",
       "                                                                                                                                                  answer  \n",
       "0  Solar power and hydropower differ significantly in consistency and environmental impact:\\n\\n- **Consistency**:  \\n  - **Solar Power**: Inconsisten...  \n",
       "1  **Consistency:**\\n- **Hydropower** is highly reliable and provides consistent, large-scale power, as it is not dependent on weather conditions onc...  \n",
       "2  **Consistency:**  \\n- **Hydropower** is highly reliable and consistent, providing large-scale power 24/7, as it is not dependent on weather or tim...  \n",
       "3  **Consistency:**  \\n- **Hydropower** is highly reliable and provides consistent, large-scale power, as it is not dependent on weather conditions o...  \n",
       "4  **Consistency:**\\n- **Hydropower** is highly reliable and consistent, providing large-scale power 24/7, as it relies on the continuous flow of wat...  \n",
       "5  **Consistency:**  \\n- **Hydropower** is highly consistent and reliable, providing large-scale power 24/7, especially with dams that store water fo...  \n",
       "6  **Consistency:**  \\n- **Hydropower** is highly reliable and provides consistent, large-scale power, especially with dams that can store water and ...  \n",
       "7  **Consistency:**  \\n- **Hydropower** is highly reliable and provides consistent, large-scale power, especially with dams that can store water and ...  \n",
       "8  **Consistency:**  \\n- **Solar Power:** Inconsistent due to dependence on weather and daylight. Requires storage solutions (e.g., batteries) for re...  \n",
       "9  **Consistency:**  \\n- **Solar Power:** Inconsistent due to dependence on weather and daylight. Requires storage solutions (e.g., batteries) for re...  "
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "--- Best Configuration Summary ---\n",
      "Chunk Size: 250 words\n",
      "Overlap: 50 words\n",
      "Top-K Retrieved: 3 chunks\n",
      "Strategy: Simple RAG\n",
      "---> Average Score (Faith+Rel+Sim): 0.899\n",
      "      (Faithfulness: 0.900, Relevancy: 1.000, Similarity: 0.798)\n",
      "Time Taken: 8.98 seconds\n",
      "\n",
      "Best Answer Generated:\n",
      "Solar power and hydropower differ significantly in consistency and environmental impact:\n",
      "\n",
      "- **Consistency**:  \n",
      "  - **Solar Power**: Inconsistent, as it depends on weather conditions and time of day. Requires storage solutions (like batteries) for reliable supply.  \n",
      "  - **Hydropower**: Highly consistent, providing large-scale, reliable power 24/7, especially with dams.  \n",
      "\n",
      "- **Environmental Impact**:  \n",
      "  - **Solar Power**: Clean with minimal emissions during operation, but manufacturing panels and disposal can have environmental impacts.  \n",
      "  - **Hydropower**: Large dams can severely harm ecosystems, disrupt fish migration, and displace communities. Run-of-river systems are less disruptive but still impact local environments.  \n",
      "\n",
      "In summary, hydropower is more consistent but has greater environmental risks, while solar power is cleaner but less reliable without storage.\n",
      "\n",
      "--- Analysis Complete --- \n"
     ]
    }
   ],
   "source": [
    "print(\"--- Analyzing Experiment Results ---\")\n",
    "\n",
    "# First, check if any results were actually collected\n",
    "if not all_results:\n",
    "    print(\"No results were generated during the experiment. Cannot perform analysis.\")\n",
    "else:\n",
    "    # Convert the list of result dictionaries into a Pandas DataFrame\n",
    "    results_df = pd.DataFrame(all_results)\n",
    "    print(f\"Total results collected: {len(results_df)}\")\n",
    "    \n",
    "    # Sort the DataFrame based on the 'avg_score' column in descending order (best first)\n",
    "    # Use reset_index(drop=True) to get a clean 0-based index after sorting.\n",
    "    results_df_sorted = results_df.sort_values(by='avg_score', ascending=False).reset_index(drop=True)\n",
    "    \n",
    "    print(\"\\n--- Top 10 Performing Configurations (Sorted by Average Score) ---\")\n",
    "    # Define the columns we want to display in the summary table\n",
    "    display_cols = [\n",
    "        'chunk_size', 'overlap', 'top_k', 'strategy', \n",
    "        'avg_score', 'faithfulness', 'relevancy', 'similarity_score', # Added similarity\n",
    "        'time_sec', \n",
    "        'answer' # Including the answer helps qualitatively assess the best runs\n",
    "    ]\n",
    "    # Filter out any columns that might not exist (e.g., if an error occurred before population)\n",
    "    display_cols = [col for col in display_cols if col in results_df_sorted.columns]\n",
    "    \n",
    "    # Display the head (top 10 rows) of the sorted DataFrame using the selected columns\n",
    "    # The display() function provides richer output in Jupyter environments.\n",
    "    display(results_df_sorted[display_cols].head(10))\n",
    "    \n",
    "    # --- Summary of the Single Best Run --- \n",
    "    print(\"\\n--- Best Configuration Summary ---\")\n",
    "    # Check if the sorted DataFrame is not empty before accessing the first row\n",
    "    if not results_df_sorted.empty:\n",
    "        # Get the first row (index 0), which corresponds to the best score after sorting\n",
    "        best_run = results_df_sorted.iloc[0]\n",
    "        \n",
    "        # Print the parameters and results of the best configuration\n",
    "        print(f\"Chunk Size: {best_run.get('chunk_size', 'N/A')} words\")\n",
    "        print(f\"Overlap: {best_run.get('overlap', 'N/A')} words\")\n",
    "        print(f\"Top-K Retrieved: {best_run.get('top_k', 'N/A')} chunks\")\n",
    "        print(f\"Strategy: {best_run.get('strategy', 'N/A')}\")\n",
    "        # Use .get(col, default) for robustness in case a column is missing\n",
    "        avg_score = best_run.get('avg_score', 0.0)\n",
    "        faithfulness = best_run.get('faithfulness', 0.0)\n",
    "        relevancy = best_run.get('relevancy', 0.0)\n",
    "        similarity = best_run.get('similarity_score', 0.0)\n",
    "        time_sec = best_run.get('time_sec', 0.0)\n",
    "        best_answer = best_run.get('answer', 'N/A')\n",
    "        \n",
    "        print(f\"---> Average Score (Faith+Rel+Sim): {avg_score:.3f}\")\n",
    "        print(f\"      (Faithfulness: {faithfulness:.3f}, Relevancy: {relevancy:.3f}, Similarity: {similarity:.3f})\")\n",
    "        print(f\"Time Taken: {time_sec:.2f} seconds\")\n",
    "        print(f\"\\nBest Answer Generated:\")\n",
    "        # Print the full answer generated by the best configuration\n",
    "        print(best_answer)\n",
    "    else:\n",
    "        # Handle the case where no results were successfully processed\n",
    "        print(\"Could not determine the best configuration (no valid results found).\")\n",
    "        \n",
    "print(\"\\n--- Analysis Complete --- \")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "conclusion-v5",
   "metadata": {},
   "source": [
    "### 10. Conclusion: What Did We Learn?\n",
    "\n",
    "We have successfully constructed and executed an end-to-end pipeline to experiment with various RAG configurations and evaluate their performance using multiple metrics on the Nebius AI platform.\n",
    "\n",
    "By examining the results table and the best configuration summary above, we can gain insights specific to *our chosen corpus, query, and models*.\n",
    "\n",
    "**Reflection Points:**\n",
    "\n",
    "*   **Chunking Impact:** Did a specific `chunk_size` or `overlap` tend to produce better average scores? Consider why smaller chunks might capture specific facts better, while larger chunks might provide more context. How did overlap seem to influence the results?\n",
    "*   **Retrieval Quantity (`top_k`):** How did increasing `top_k` affect the scores? Did retrieving more chunks always lead to better answers, or did it sometimes introduce noise or irrelevant information, potentially lowering faithfulness or similarity?\n",
    "*   **Strategy Comparison:** Did the 'Query Rewrite' or 'Rerank (Simulated)' strategies offer a consistent advantage over 'Simple RAG' in terms of the average score? Was the potential improvement significant enough to justify the extra steps (e.g., additional LLM call for rewrite, larger initial retrieval for rerank)?\n",
    "*   **Evaluation Metrics:** \n",
    "    *   Look at the 'Best Answer' and compare it to the `true_answer_for_query`. Do the individual scores (Faithfulness, Relevancy, Similarity) seem to reflect the quality you perceive?\n",
    "    *   Did high similarity always correlate with high faithfulness? Could an answer be similar but unfaithful, or faithful but dissimilar? \n",
    "    *   How reliable do you feel the automated LLM evaluation (Faithfulness, Relevancy) is compared to the more objective Cosine Similarity? What are the potential limitations of LLM-based evaluation (e.g., sensitivity to prompt wording, model biases)?\n",
    "*   **Overall Performance:** Did any configuration achieve a near-perfect average score? What might be preventing a perfect score (e.g., limitations of the source documents, inherent ambiguity in language, imperfect retrieval)?\n",
    "\n",
    "**Key Takeaway:** Optimizing a RAG system is an iterative process. The best configuration often depends heavily on the specific dataset, the nature of the user queries, the chosen embedding and LLM models, and the evaluation criteria. Systematic experimentation, like the process followed in this notebook, is crucial for finding settings that perform well for a particular use case.\n",
    "\n",
    "**Potential Next Steps & Further Exploration:**\n",
    "\n",
    "*   **Expand Test Parameters:** Try a wider range of `chunk_size`, `overlap`, and `top_k` values.\n",
    "*   **Different Queries:** Test the same configurations with different types of queries (e.g., fact-based, comparison, summarization) to see how performance varies.\n",
    "*   **Larger/Different Corpus:** Use a more extensive or domain-specific knowledge base.\n",
    "*   **Implement True Reranking:** Replace the simulated reranking with a dedicated cross-encoder reranking model (e.g., from Hugging Face Transformers or Cohere Rerank) to re-score the initially retrieved documents based on relevance.\n",
    "*   **Alternative Models:** Experiment with different Nebius AI models for embedding, generation, or evaluation to see their impact.\n",
    "*   **Advanced Chunking:** Explore more sophisticated chunking strategies (e.g., recursive character splitting, semantic chunking).\n",
    "*   **Human Evaluation:** Complement the automated metrics with human judgment for a more nuanced understanding of answer quality."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv-best-rag-finder",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
